Skip to content

RFC: Introduce pandas.col #62103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Aug 13, 2025

xref @jbrockmendel 's comment #56499 (comment)

I'd also discussed this with @phofl , @WillAyd , and @jorisvandenbossche (who originally showed us something like this in Basel at euroscipy 2023)

Demo:

import pandas as pd
from datetime import datetime

df = pd.DataFrame(
    {
        "a": [1, -2, 3],
        "b": [4, 5, 6],
        "c": [datetime(2020, 1, 1), datetime(2025, 4, 2), datetime(2026, 12, 3)],
        "d": ["fox", "beluga", "narwhal"],
    }
)

result = df.assign(
    # The usual Series methods are supported
    a_abs=pd.col("a").abs(),
    # And can be combined
    a_centered=pd.col("a") - pd.col("a").mean(),
    a_plus_b=pd.col("a") + pd.col("b"),
    # Namespace are supported too
    c_year=pd.col("c").dt.year,
    c_month_name=pd.col("c").dt.strftime("%B"),
    d_upper=pd.col("d").str.upper(),
).loc[pd.col("a_abs") > 1]  # This works in `loc` too

print(result)

Output:

   a  b          c        d  a_abs  a_centered  a_plus_b  c_year c_month_name  d_upper
1 -2  5 2025-04-02   beluga      2   -2.666667         3    2025        April   BELUGA
2  3  6 2026-12-03  narwhal      3    2.333333         9    2026     December  NARWHAL

Repr demo:

In [4]: pd.col('value')
Out[4]: col('value')

In [5]: pd.col('value') * pd.col('weight')
Out[5]: (col('value') * col('weight'))

In [6]: (pd.col('value') - pd.col('value').mean()) / pd.col('value').std()
Out[6]: ((col('value') - col('value').mean()) / col('value').std())

In [7]: pd.col('timestamp').dt.strftime('%B')
Out[7]: col('timestamp').dt.strftime('%B')

What's here should be enough for it to be usable. For the type hints to show up correctly, extra work should be done in pandas-stubs. But, I think it should be possible to develop tooling to automate the Expr docs and types based on the Series ones (going to cc @Dr-Irv here too then)

As for the "col" name, that's what PySpark, Polars, Daft, and Datafusion use, so I think it'd make sense to follow the convention


I'm opening as a request for comments. Would people want this API to be part of pandas?

One of my main motivations for introducing it is that it avoids common issues with scoping. For example, if you use assign to increment two columns' values by 10 and try to write df.assign(**{col: lambda df: df[col] + 10 for col in ('a', 'b')}) then you'll be in for a big surprise

In [19]: df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})

In [20]: df.assign(**{col: lambda df: df[col] + 10 for col in ('a', 'b')})
Out[20]:
    a   b
0  14  14
1  15  15
2  16  16

whereas with pd.col, you get what you were probably expecting:

In [4]: df.assign(**{col: pd.col(col) + 10 for col in ('a', 'b')})
Out[4]: 
    a   b
0  11  14
1  12  15
2  13  16

Further advantages:

  • expressions are introspectable so the repr can be made to look nice, whereas an anonymous lambda is always going to look something like <function __main__.<lambda>(df)
  • the syntax looks more modern and more aligned with modern tools

Expected objections:

  • this expands the pandas API even further. Sure, I don't disagree, but I think this is a common enough and longstanding enough request that it's worth expanding it for this

TODO:

  • tests, API docs, user guide. But first, I just wanted to get a feel for people's thoughts, and to see if anyone's opposed to it

@MarcoGorelli MarcoGorelli changed the title ENH: Introduce pandas.col RFC: Introduce pandas.col Aug 13, 2025
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 13, 2025

For the type hints to show up correctly, extra work should be done in pandas-stubs. But, I think it should be possible to develop tooling to automate the Expr docs and types based on the Series ones (going to cc @Dr-Irv here too then)

When this is added, and then released, pandas-stubs can be updated with proper stubs.

One comment is that I'm not sure it will support some basic arithmetic, such as:

result = df.assign(addcon=pd.col("a") + 10)

Or alignment with other series:

b = df["b"]  # or this could be from a different DF
result = df.assign(add2=pd.col("a") + b)

Also, don't you need to add some tests??

@MarcoGorelli
Copy link
Member Author

Thanks for taking a look!

One comment is that I'm not sure it will support some basic arithmetic [...] Or alignment with other series:

Yup, they're both supported:

In [8]: df = pd.DataFrame({'a': [1,2,3]})

In [9]: s = pd.Series([90,100,110], index=[2,1,0])

In [10]: df.assign(
    ...:     b=pd.col('a')+10,
    ...:     c=pd.col('a')+s,
    ...: )
Out[10]: 
   a   b    c
0  1  11  111
1  2  12  102
2  3  13   93

Also, don't you need to add some tests??

😄 Definitely, I just wanted to test the waters first, as I think this would be perceived as a significant API change

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 13, 2025

Definitely, I just wanted to test the waters first, as I think this would be perceived as a significant API change

I don't see it as a "change", more like an addition to the API that makes it easier to use. The existing way of using df.assign(foo=lambda df: df["a"] + df["b"]) would still work, but df.assign(foo=pd.col("a") + pd.col("b")) is cleaner.

@jbrockmendel
Copy link
Member

Is assign the main use case?

@MarcoGorelli
Copy link
Member Author

Currently it would only work in places that accept DataFrame -> Series callables which, as far as I know, is only DataFrame.assign and filtering with DataFrame.loc

Getting it to work in GroupBy.agg is more complex, but it is possible, albeit with some restrictions

@MarcoGorelli MarcoGorelli marked this pull request as ready for review August 14, 2025 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants